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ABSTRACT 

In  this  paper,  we  propose  a  stochastic  search  algorithm  for  solving  general  optimization 
problems  with  little  structure.  The  algorithm  iteratively  finds  high  quality  solutions  by 
randomly  sampling  candidate  solutions  from  a  parameterized  distribution  model  over  the 
solution  space.  The  basic  idea  is  to  convert  the  original  (possibly  non-differentiable)  prob¬ 
lem  into  a  differentiable  optimization  problem  on  the  parameter  space  of  the  parameterized 
sampling  distribution,  and  then  use  a  direct  gradient  search  method  to  find  improved  sam¬ 
pling  distributions.  Thus,  the  algorithm  combines  the  robustness  feature  of  stochastic 
search  from  considering  a  population  of  candidate  solutions  with  the  relative  fast  conver¬ 
gence  speed  of  classical  gradient  methods  by  exploiting  local  differentiable  structures.  We 
analyze  the  convergence  and  converge  rate  properties  of  the  proposed  algorithm,  and  carry 
out  numerical  study  to  illustrate  its  performance. 

1.  Introduction 

We  consider  global  optimization  problems  over  real  vector-valued  domains.  These  opti¬ 
mization  problems  arise  in  many  areas  of  importance  and  can  be  extremely  difficult  to 
solve  due  to  the  presence  of  multiple  local  optimal  solutions  and  the  lack  of  structural 
properties  such  as  differentiability  and  convexity.  In  such  a  general  setting,  there  is  little 
problem-specific  knowledge  that  can  be  exploited  in  searching  for  improved  solutions,  and 
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it  is  often  the  case  that  the  objective  function  can  only  be  assessed  through  the  form  of 
“black-box”  evaluation,  which  returns  the  function  value  for  a  specified  candidate  solution. 

An  effective  and  promising  approach  for  tackling  such  general  optimization  problems 
is  stochastic  search.  This  refers  to  a  collection  of  methods  that  use  some  sort  of  ran¬ 
domized  mechanism  to  generate  a  sequence  of  iterates,  e.g.,  candidate  solutions,  and  then 
use  the  sequence  of  iterates  to  successively  approximate  the  optimal  solution.  Over  the 
past  years,  various  stochastic  search  algorithms  have  been  proposed  in  literature.  These 
include  approaches  such  as  simulated  annealing  [10],  genetic  algorithms  [7],  tabu  search 
[6],  pure  adaptive  search  [28],  and  sequential  Monte  Carlo  simulated  annealing  [29],  which 
produce  a  sequence  of  candidate  solutions  that  are  gradually  improving  in  performance; 
the  nested  partitions  method  [25],  which  uses  a  sequence  of  partitions  of  the  feasible  region 
as  intermediate  constructions  to  find  high  quality  solutions;  and  the  more  recent  class  of 
model-based  algorithms  (see  a  survey  by  [30]),  which  construct  a  sequence  of  distribution 
models  to  characterize  promising  regions  of  the  solution  space. 

This  paper  focuses  on  model-based  algorithms.  These  algorithms  typically  assume  a 
sampling  distribution  (i.e.,  a  probabilistic  model),  often  within  a  parameterized  family  of 
distributions,  over  the  solution  space,  and  iteratively  carry  out  the  two  interrelated  steps: 
(1)  draw  candidate  solutions  from  the  sampling  distribution;  (2)  use  the  evaluations  of  these 
candidate  solutions  to  update  the  sampling  distribution.  The  hope  is  that  at  every  iteration 
the  sampling  distribution  is  biased  towards  the  more  promising  regions  of  the  solution  space, 
and  will  eventually  concentrate  on  one  or  more  of  the  optimal  solutions.  Examples  of  model- 
based  algorithms  include  ant  colony  optimization  [4,  3],  annealing  adaptive  search  (AAS) 
[22],  probability  collectives  (PCs)  [27],  the  estimation  of  distribution  algorithms  (ED As) 
[14,  19],  the  cross-entropy  (CE)  method  [23],  model  reference  adaptive  search  (MRAS) 
[8],  and  the  interacting-particle  algorithm  [17,  18].  The  various  model-based  algorithms 
mainly  differ  in  their  ways  of  updating  the  sampling  distribution.  Recently,  [9]  showed 
that  the  updating  schemes  in  some  model-based  algorithms  can  be  viewed  under  a  unified 
framework.  The  basic  idea  is  to  convert  the  original  optimization  problem  into  a  sequence 
of  stochastic  optimization  problems  with  differentiable  structures,  so  that  the  distribution 
updating  schemes  in  these  algorithms  can  be  equivalently  transformed  into  the  form  of 
stochastic  approximation  procedures  for  solving  the  sequence  of  stochastic  optimization 
problems. 

Because  model-based  algorithms  work  with  a  population  of  candidate  solutions  at  each 
iteration,  they  demonstrate  more  robustness  in  exploring  the  solution  space  as  compared 
with  their  classical  counterparts  that  work  with  a  single  candidate  solution  each  time  (e.g., 
simulated  annealing).  The  main  motivation  of  this  paper  is  to  integrate  this  robustness 
feature  of  model-based  algorithms  into  familiar  gradient-based  tools  from  classical  differen¬ 
tiable  optimization  to  facilitate  the  search  for  good  sampling  distributions.  The  underlying 
idea  is  to  reformulate  the  original  (possibly  non-differentiable)  optimization  problem  into  a 
differentiable  optimization  problem  over  the  parameter  space  of  the  sampling  distribution, 
and  then  use  a  direct  gradient  search  method  on  the  parameter  space  to  solve  the  new 
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formulation.  This  leads  to  a  natural  algorithmic  framework  that  combines  the  advantages 
of  both  methods:  the  fast  convergence  of  gradient-based  methods  and  the  global  explo¬ 
ration  of  stochastic  search.  Specifically,  each  iteration  of  our  proposed  method  consists 
of  the  following  two  steps:  (1)  generate  candidate  solutions  from  the  current  sampling 
distribution;  (2)  update  the  parameters  of  the  sampling  distribution  using  a  direct  gra¬ 
dient  search  method.  Although  there  are  a  variety  of  gradient-based  algorithms  that  are 
applicable  in  step  (2)  above,  in  this  paper  we  focus  on  a  particular  algorithm  that  uses  a 
quasi-Newton  like  procedure  to  update  the  sampling  distribution  parameters.  Note  that 
since  the  algorithm  uses  only  the  information  contained  in  the  sampled  solutions,  it  differs 
from  the  quasi-Newton  method  in  deterministic  optimization,  in  that  there  is  an  extra 
Monte  Carlo  sampling  noise  involved  at  each  parameter  updating  step.  We  show  that  this 
stochastic  version  of  quasi-Newton  iteration  can  be  expressed  in  the  form  of  a  generalized 
Robbins-Monro  algorithm,  and  this  in  turn  allows  us  to  use  the  existing  tools  from  stochas¬ 
tic  approximation  theory  to  analyze  the  asymptotic  convergence  and  convergence  rate  of 
the  proposed  algorithm. 

The  rest  of  the  paper  is  organized  as  follows.  We  introduce  the  problem  setting  for¬ 
mally  in  Section  2.  Section  3  provides  a  description  of  the  proposed  algorithm  along  with 
the  detailed  derivation  steps.  In  Section  4,  we  analyze  the  asymptotic  properties  of  the 
algorithm,  including  both  convergence  and  convergence  rate.  Some  preliminary  numerical 
study  are  carried  out  in  Section  5  to  illustrate  the  performance  of  the  algorithm.  Finally, 
we  conclude  this  paper  in  Section  6.  All  the  proofs  are  contained  in  the  Appendix. 

2.  Problem  Formulation 

Consider  the  maximization  problem 

x*  G  arg  max  i7(x) ,  X  C  Mn.  (1) 

where  the  solution  space  A  is  a  nonempty  compact  set  in  Rn,  and  H  :  X  — >  M  is  a  real¬ 
valued  function.  Denote  the  optimal  function  value  as  H*,  i.e.,  there  exists  an  x*  such  that 
H(x)  <H*  =  H(x*),  Vx  G  X.  Assume  that  H  is  bounded  on  X ,  i.e.,  3Hu,  >  — oo,  Hub  < 
oo  s.t.  Hu,  <  H(x)  <  Hub ,  Vx  G  X.  We  consider  problems  where  the  objective  function 
H (x)  lacks  nice  structural  properties  such  as  differentiability  and  convexity  and  could  have 
multiple  local  optima. 

Motivated  by  the  idea  of  using  a  sampling  distribution/probabilistic  model  in  model- 
based  optimization,  we  let  {f(x;6) \0  G  0  C  Mrf}  be  a  parameterized  family  of  probability 
density  functions  (pdfs)  on  X  with  0  being  a  parameter  space.  Intuitively,  this  collection 
of  pdfs  can  be  viewed  abstractly  as  probability  models  characterizing  our  knowledge  or 
belief  of  the  promising  regions  of  the  solution  space.  It  is  easy  to  see  that 

[  H(x)f{x ;  0)dx  <  H*,  V9  G  0. 
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In  this  paper,  we  simply  write  f  with  the  understanding  that  the  integrals  are  taken  over 
X.  Note  that  the  equality  on  the  righthand  side  above  is  achieved  whenever  there  exists 
an  optimal  parameter  under  which  the  parameterized  probability  distribution  will  assign 
all  of  its  probability  mass  to  a  subset  of  the  set  of  global  optima  of  (1).  Hence,  one  natural 
idea  to  solving  (1)  is  to  transform  the  original  problem  into  an  expectation  of  the  objective 
function  under  the  parameterized  distribution  and  try  to  find  the  best  parameter  9*  within 
the  parameter  space  0  such  that  the  expectation  under  f(x,9*)  can  be  made  as  large  as 
possible,  i.e. , 

9*  =  arg  max  J  H(x)f(x;9)dx.  (2) 

So  instead  of  considering  directly  the  original  function  H (x)  that  is  possibly  non-differentiable 
and  discontinuous  in  x,  we  now  consider  the  new  objective  function  f  H(x)f(x ;  9)dx  that  is 
continuous  on  the  parameter  space  and  usually  differentiable  with  respect  to  9.  For  exam¬ 
ple,  under  mild  conditions  the  differentiation  can  be  brought  into  the  integration  to  apply 
on  the  p.d.f.  /(x;  9),  which  is  differentiable  given  an  appropriate  choice  of  the  distribution 
family  such  as  an  exponential  family  of  distributions. 

The  formulation  of  (2)  suggests  a  natural  integration  of  stochastic  search  methods 
on  the  solution  space  X  with  gradient-based  optimization  techniques  on  the  continuous 
parameter  space.  Conceptually,  that  is  to  iteratively  carry  out  the  following  two  steps: 

1.  Generate  candidate  solutions  from  f(x;9)  on  the  solution  space  X. 

2.  Use  a  gradient-based  method  for  the  problem  (2)  to  update  the  parameter  9. 

The  motivation  is  to  speed  up  stochastic  search  with  a  guidance  on  the  parameter  space, 
and  hence  combine  the  advantages  of  both  methods:  the  fast  convergence  of  gradient-based 
methods  and  the  global  exploration  of  stochastic  search  methods.  Even  though  problem 
(2)  may  be  non-concave  and  multi-modal  in  9,  the  sampling  from  the  entire  original  space 
X  compensates  the  local  exploitation  along  the  gradient  on  the  parameter  space.  In  fact, 
our  algorithm  developed  later  will  automatically  adjust  the  magnitude  of  the  gradient 
step  on  the  parameter  space  according  to  the  global  information,  i.e.,  our  belief  about  the 
promising  regions  of  the  solution  space. 

For  algorithmic  development  later,  we  introduce  a  shape  function  S$  :  M  — >  R+,  where 
the  subscript  9  signifies  the  possible  dependence  of  the  shape  function  on  the  parameter  9. 

The  function  Sg  satisfies  the  following  conditions: 

(a)  For  every  9 ,  Sg(y )  is  nondecreasing  in  y  and  bounded  from  above  and  below  for 
bounded  y,  with  the  lower  bound  being  away  from  zero.  Moreover,  for  every  fixed  y, 

Sg(y )  is  continuous  in  9; 

(b)  The  set  of  optimal  solutions  {arg  maxxeX  Sg(H (x))}  is  a  non-empty  subset  of  {arg  maxx  eXH(x)}, 
the  set  of  optimal  solutions  of  the  original  problem  (1). 
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Therefore,  solving  (1)  is  equivalent  to  solving  the  following  problem 

x*  G  argrna xSg(H(x)).  (3) 

The  main  reason  of  introducing  the  shape  function  Sg  is  to  ensure  positivity  of  the  objective 
function  Sg(H(x))  under  consideration,  since  Sg(H(x))  will  be  used  to  form  a  probability 
density  function  later.  Moreover,  the  choice  of  Sq  can  also  be  used  to  adjust  the  trade-off 
between  exploration  and  exploitation  in  stochastic  search.  One  choice  of  such  a  shape 
function,  similar  to  the  level/indicator  function  used  in  the  CE  method  and  MRAS,  is 

Se(HM)  =  (H(X)  -  (4) 

where  So  is  a  large  positive  constant,  and  70  is  the  (1  —  p)-quantile 

7 fl  —  sup  {l  :  Pe{x  G  X  :  H(x)  >  1}  >  p} , 

l 

where  Pg  denotes  the  probability  with  respect  to  /(•;  8).  Notice  that  \/ is 
a  continuous  approximation  of  the  indicator  function  I{H(x )  >  70},  this  shape  function  Sg 
essentially  prunes  the  level  sets  below  7 g.  By  varying  p,  we  can  adjust  the  percentile  of  elite 
samples  that  are  selected  to  update  the  next  sampling  distribution:  the  bigger  p,  the  less 
elite  samples  selected  and  hence  more  emphasis  is  put  on  exploiting  the  neighborhood  of  the 
current  best  solutions.  Sometimes  the  function  Sg  could  also  be  chosen  to  be  independent 
of  8 ,  i.e.,  Sg  =  S  :  M  — >  R+,  such  as  the  function  S(y)  =  exp(y). 

For  an  arbitrary  but  fixed  8'  G  Rrf,  define  the  function 

me')  4  J  Sgi(H(x))f(x]  8)dx. 

According  to  the  conditions  on  Sg,  it  always  holds  that 

0  <  L(8;  8')  <  Sgi{H*)  V0, 

and  the  equality  is  achieved  if  there  exists  an  optimal  parameter  such  that  the  probability 
mass  of  the  parameterized  distribution  is  concentrated  only  on  a  subset  of  the  set  of  global 
optima.  Following  the  same  idea  that  leads  to  (2),  solving  (3)  and  thus  (1)  can  be  converted 
to  the  problem  of  trying  to  find  the  best  parameter  8*  within  the  parameter  space  by  solving 
the  following  maximization  problem: 

8*  =  argrna xL(8:9').  (5) 

flee 

Same  as  problem  (2),  L(8]8')  may  be  nonconcave  and  multi-modal  in  8. 
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3.  Gradient-Based  Adaptive  Stochastic  Search 


Following  the  formulation  in  the  previous  section,  we  propose  a  stochastic  search  algorithm 
that  carries  out  the  following  two  steps  at  each  iteration:  let  9k  be  the  parameter  obtained 
at  the  kth  iteration, 

1.  Generate  candidate  solutions  from  f(x]9k). 

2.  Update  the  parameter  to  9k+ i  using  a  quasi  Newton’s  iteration  for  max#  L(9\  9k). 

Assuming  it  is  easy  to  draw  samples  from  f(x:  9),  then  the  main  obstacle  is  to  find  expres¬ 
sions  of  the  gradient  and  Hessian  of  L(9 ;  9k)  that  can  be  nicely  estimated  using  the  samples 
from  f(x-,9).  To  overcome  this  obstacle,  we  choose  {f(x;9)}  to  be  an  exponential  family 
of  densities  defined  as  below. 

Definition  1.  A  family  {f(x]  9)  :  9  e  0}  is  an  exponential  family  of  densities  if  it  satisfies 
f(x;  9)  =  exp{07  T(x)  —  <t>{0)},  <f(0)  =  In  exp(0TT(x))iix|  .  (6) 


where  T(x)  =  [Tf  (x) ,  T2  (x) , . . . ,  Td(x)]T  is  the  vector  of  sufficient  statistics,  9  =  [0\,  62, . .  . ,  0d]T 
is  the  vector  of  natural  parameters,  and  0  =  {9  €  :  \<j>{0)\  <  00}  is  the  natural  parame¬ 

ter  space  with  a  nonempty  interior. 

Define  the  density  function 

/  .  a  se(H(x))f(x-,0)  =  Sg(H(x))f(x;0) 

P  1'  J  Sg(H(x))f(x;9)dx  L(0;0) 

With  f{-\9)  from  an  exponential  family,  we  propose  the  following  updating  scheme  for  9 
in  step  2  above: 

9k+1  =  9k  +  afe(Var ek[T(X)}  +  el)~l  (. EPk[T(X )]  -  Egk[T(X)]) ,  (8) 


where  e  >  0  is  a  small  positive  number,  ak  >  0  is  the  step  size,  EPk  denotes  the  expectation 
with  respect  to  p(-;  9k),  and  Egk  and  Var gk  denote  the  expectation  and  variance  taken  with 
respect  to  f{-\9k ),  respectively.  The  role  of  el  is  to  ensure  the  positive  definiteness  of 
(Var gk[T(X)}  +  el)  such  that  it  can  be  inverted.  The  term  (EPk[T(X)\  —  Egk[T(X)})  is  an 
ascent  direction  of  L(9\  9k),  which  will  be  shown  in  the  next  section. 

To  implement  the  updating  scheme  (8),  the  term  EPk[T(X)\  is  often  not  analytically 
available  and  needs  to  be  estimated.  Suppose  {aq, . . . ,  XNk}  are  independent  and  identically 
distributed  (i.i.d.)  samples  drawn  from  f(x\9k )■  Since 


EPk[T(X)]  =  Eek 


T(X) 


p(X-9ky 

f(X-,Ok)_ 
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we  compute  the  weights  {'wj.}  for  the  samples  { xlk }  according  to 

i  p(xk’^k)  Q  /  £77  i \\  •  1  AT 

Wk(X  t(  i.  a  Sek(H(xk)),  i  =  l,...,Nk, 

N 

J2w^  =  1- 

i= 1 

Hence,  EPk[T(X)\  can  be  approximated  by 

Nk 

Epk[T{X)}  =  YJwlT{  4)-  (9) 

i=l 

Some  forms  of  the  function  Sgk(H(x))  have  to  be  approximated  by  samples  as  well.  For 
example,  if  Sgk(H(x))  takes  the  form  (4),  the  quantile  ^gk  needs  to  be  estimated  by  the 
sample  quantile.  In  this  case,  we  denote  the  approximation  by  Sgk(H(x)),  and  evaluate 
the  normalized  weights  according  to 

Wi  oc  Sok  i  =  1, . . . ,  Nk. 

Then  the  term  EPk[T(X)\  is  approximated  by 

Nk 

EPk[T(X)}  =  ^2wlT(x{).  (10) 

i—  1 

In  practice,  the  variance  term  Var^,.  [T(X) ]  in  (8)  may  not  be  directly  available  or  could  be 
too  complicated  to  compute  analytically,  so  it  also  often  needs  to  be  estimated  by  samples: 

Nk  /  Nk  \  /  Nk  \  ^ 

vm-oJUx)]  =  T  x;  n4m4 f  -  72  _  jv  E  r<4)  E  r<4)  in) 

i=  1  ^  \  *=1  /  \  j=l  / 

The  expectation  term  Egk[T(X)\  can  be  evaluated  analytically  in  most  cases.  For  example, 
if  {/(.;0fc)}  is  chosen  as  the  Gaussian  family,  then  Egk[T(X)\  reduces  to  the  mean  and 
second  moment  of  the  Gaussian  distribution. 

Based  on  the  updating  scheme  of  0,  we  propose  the  following  algorithm  for  solving  (1). 
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Algorithm  1  Gradient-Based  Adaptive  Stochastic  Search  (GASS) 

1.  Initialization:  choose  an  exponential  family  of  densities  {/(•;  0)},  and  specify  a  small 
positive  constant  e,  initial  parameter  9q,  sample  size  sequence  {-/V^},  and  step  size 
sequence  {ak}-  Set  k  =  0. 

2.  Sampling:  draw  samples  x\  ~  /(x;  9k),i  =  1,2,...,  Nk. 

3.  Estimation:  compute  the  normalized  weights  wlk  according  to 

and  then  compute  EPk[T(X )]  and  Var gk[T(X)\  respectively  according  to  (10)  and 

(11)- 

4.  Updating:  update  the  parameter  9  according  to 

9k+1  =  rig  { 9k  +  ak(Sfaok[T(X)]  +  eI)-\EPk[T(X )]  -  Egk[T(X)])}  , 

where  0  C  0  is  a  non-empty  compact  connected  constraint  set,  and  Ilg  denotes 
the  projection  operator  that  projects  an  iterate  back  onto  the  set  0  by  choosing  the 
closest  point  in  0. 

5.  Stopping:  check  if  some  stopping  criterion  is  satisfied.  If  yes,  stop  and  return  the 
current  best  sampled  solution;  else,  set  k  :=  k  +  1  and  go  back  to  step  2. 


In  the  above  algorithm,  at  the  kth  iteration  candidate  solutions  are  drawn  from  the 
sampling  distribution  f(-‘,9k),  and  then  are  used  to  estimate  the  quantities  in  the  updating 
equation  for  9k  so  as  to  generate  the  next  sampling  distribution  /(■;  9k+\).  To  develop 
an  intuitive  understanding  of  the  algorithm,  we  consider  the  special  setting  T(X)  =  X , 
in  which  case  the  term  Var Qk[T(X)\  basically  measures  how  widespread  the  candidate 
solutions  are.  Since  the  magnitude  of  the  ascent  step  is  determined  by  (Var#fc  [T(A)]+e/)_1, 
the  algorithm  takes  smaller  ascent  steps  to  update  9  when  the  candidate  solutions  are  more 
widely  spread  (i.e.,  XaiQk[X]  is  larger),  and  takes  larger  ascent  steps  when  the  candidate 
solutions  are  more  concentrated  (i.e.,  Vargfe[A]  is  smaller).  It  means  that  exploitation  of 
the  local  structure  is  adapted  to  our  belief  about  the  promising  regions  of  the  solution 
space:  we  will  be  more  conservative  in  exploitation  if  we  are  uncertain  about  where  the 
promising  regions  are  and  more  progressive  otherwise.  Note  that  the  projection  operator 
at  step  4  is  primarily  used  to  ensure  the  numerical  stability  of  the  algorithm.  It  prevents 
the  iterates  of  the  algorithm  from  becoming  too  big  in  practice  and  ensures  the  sequence 


{Ok)  to  stay  bounded  as  the  search  proceeds.  For  simplicity,  we  will  assume  that  ©  is  a 
hyper-rectangle  and  takes  the  form  0  =  {0  £  0  :  at  <  0-L  <  bi)  for  constants  a*  <  b{. 
i  =  1 , . . .  ,d]  other  more  general  choices  of  0  may  also  be  used  (see,  e.g.,  Section  4.3  of 
[13]).  Intuitively,  such  a  constraint  set  should  be  chosen  sufficiently  large  in  practice  so 
that  the  limits  of  the  recursion  at  step  4  without  the  projection  are  contained  in  its  interior. 

3.1  Accelerated  GASS 

GASS  can  be  viewed  as  a  stochastic  approximation  (SA)  algorithm,  which  we  will  show  in 
more  details  in  the  next  section.  To  improve  the  convergence  rate  of  SA  algorithms,  [20] 
and  [24]  first  proposed  to  take  the  average  of  the  8  values  generated  by  previous  iterations, 
which  is  often  referred  to  as  Polyak  (or  Polyak-Ruppert)  averaging.  The  original  Polyak 
averaging  technique  is  “offline” ,  i.e.,  the  averages  are  not  fed  back  into  the  iterates  of  0 ,  and 
hence  the  averages  are  not  useful  for  guiding  the  stochastic  search  in  our  context.  However, 
there  is  a  variation,  Polyak  averaging  with  online  feedback  (c.f.  pp.  75  -  76  in  [13]),  which 
is  not  optimal  as  the  original  Polyak  averaging  but  also  enhances  the  convergence  rate  of 
SA.  Using  the  Polyak  averaging  with  online  feedback,  the  parameter  6  will  be  updated 
according  to 

Ok+1  =  n§  | Ok  +  (va ~T0k[T(X)]  +  e/)^  (EPk[T(X)\  -  Egk[T(X)])  +  akc(9k  -  0*)  j  , 


1  k 

=  jz  yi  Qj 

i=i 


which  can  be  calculated  recursively  by 

Ok  = —, — Ok- i  +  ~r-  (13) 

K  K 

With  this  parameter  updating  scheme,  we  propose  the  following  algorithm. 


Algorithm  2  Gradient-based  Adaptive  Stochastic  Search  with  Averaging 
(GASS_avg) 

Same  as  Algorithm  1  except  in  step  4  the  parameter  updating  follows  (12)  and  (13). 


3.2  Derivation 

In  this  subsection,  we  explain  the  rationale  behind  the  updating  scheme  (8).  We  first  derive 
the  expressions  of  the  gradient  and  Hessian  of  L(0:  O')  as  given  below. 
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Proposition  1.  Assume  that  f(x;9)  is  twice  differentiable  on  0  and  that  Vgf(x;9)  and 
X2Bf(x;9)  aue  both  bounded  on  X  for  any  9  e  0.  Then 

VeL(9-9')  =  Eg[Sg'(H(X))Vg  In  f(X;9)\ 

V2eL(9;9')  =  Eg[Sg,(H(X))V2glnf(X;9)] 

+  Ee[Sd'  (H(X))Xg  In  /(X;  9)Xg  In  /(X ;  0)T]. 

Furthermore,  if  f(x;  6)  is  in  an  exponential  family  of  densities  defined  by  (6),  then  the 
above  expressions  reduce  to 

XeL(9;9')  =  Eg[Se\H{X))T{X)]  -  Ee[Sg,(H(X))]Ee[T(X)}, 

X2eL{9-9')  =  Eg[Se/(H(X))(T(X)-Ee[T(XmT(X)-Ee[T(X)})T] 

-  Var g[T(X)]E0[Sgr(H(X))}. 

Notice  that  if  we  were  to  use  Newton’s  method  to  update  the  parameter  9,  the  Hessian 
XBL(9;  9')  is  not  necessarily  negative  semidefinite  to  ensure  the  parameter  updating  is 
along  the  ascent  direction  of  L(9;9'),  so  we  need  some  stabilization  scheme.  One  way 
is  to  approximate  the  Hessian  by  the  second  term  on  the  righthand  side  with  a  small 
perturbation,  i.e.,  —  (Vargi[T(X)]  +  eI)Eg[Sg> (H(X))],  which  is  always  negative  definite. 
Thus,  the  parameter  9  could  be  updated  according  to  the  following  iteration 

9k+ 1  =  9k  +  ak{{X-AYgk[T{X)]  +  eI)Egk[Sgk{H{X))})-lXgL(9k-9k),  (14) 

which  immediately  leads  to  the  updating  scheme  (8)  given  before. 

In  the  updating  equation  (8),  the  term  Egk[Sgk(H (X))]-1  is  absorbed  into  XgL(9 9 ^), 
so  we  obtain  a  scale- free  term  ( EPk  [T(X)]  —  Egk  [T(X)])  that  is  not  subject  to  the  scaling  of 
the  function  value  of  Sgk(H(x)).  It  would  be  nice  to  have  such  a  scale- free  gradient  so  that 
we  can  employ  other  gradient-based  methods  more  easily  besides  the  above  specific  choice 
of  a  quasi-Newton  method.  Towards  this  direction,  we  consider  a  further  transformation 
of  the  maximization  problem  (5)  by  letting 

l{9-9')  =ln  L{9-9'). 

Since  In  :  R+  — >  R  is  a  strictly  increasing  function,  the  maximization  problem  (5)  is 
equivalent  to 

9*  =  argmaxZ($;  9').  (15) 

e&Rd 

The  gradient  and  the  Hessian  of  1(9;  9')  are  given  in  the  following  proposition. 
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Proposition  2.  Assume  that  f(x;9 )  is  twice  differentiable  on  0  and  that  Vgf(x;9)  and 
Vg/(x;0)  are  both  bounded  on  X  for  any  9  e  0.  Then 

VeWtf) \e=o'  =  Epi..ei)[Ve  In  f(X;9')\ 

V^(M') \0=0,  =  Ep^.-g/'j  [V g  In  / (X ;  9')}  +  Varp(..0/)  [V*  ln/(X;  9')]  . 

Furthermore,  if  f(x;  9)  is  in  an  exponential  family  of  densities,  then  the  above  expressions 
reduce  to 


VeW9')\e=ei  =  Ep{..ei)[T(X)}  -  Eg,[T(X)}, 

X2el{9-9')\e=ei  =  Varp(..e0[T(X)]-Var^[T(X)]. 

Similarly  as  before,  noticing  that  the  Hessian  Xgl(9']  9’)  is  not  necessarily  negative  def¬ 
inite  to  ensure  the  parameter  updating  is  along  the  ascent  direction  of  1(9:  9'),  we  approxi¬ 
mate  the  Hessian  by  the  slightly  perturbed  second  term  in  Vgl(9'-,  9'),  i.e. ,  —  (Var0/[T(X)]  + 
el).  Then  by  setting 

0k+ 1  =  Ok  +  ak  (Var ek[T(X)}  +  el)~l  Xgl(9k-  9k), 

we  again  obtain  exactly  the  same  updating  equation  (8)  for  9.  The  difference  from  (14)  is 
that  the  gradient  Xgl(9;  9')  is  a  scale-free  term,  and  hence  can  be  used  in  other  gradient- 
based  methods  with  easier  choices  of  the  step  size.  From  the  algorithmic  viewpoint,  it  is 
better  to  consider  the  optimization  problem  (15)  on  1(9 ;  9')  instead  of  the  problem  (5)  on 
L(9 ;  9'),  even  though  both  have  the  same  global  optima. 

Although  there  are  many  ways  to  determine  the  positive  definite  matrix  in  front  of  the 
gradient  in  a  quasi-Newton  method,  our  choice  of  (Vargfc  [T(X)\  +  el )_1  is  not  arbitrary  but 
based  on  some  principle.  Without  considering  the  numerical  stability  and  thus  dropping 
the  term  el,  the  term  Vare[T(X)]  =  E[Vg  \nf(X;  9)(Vg  lnf(X;  9))1  ]  =  E[— X7g  In  f(X;9)\ 
is  the  Fisher  information  matrix,  whose  inverse  provides  a  lower  bound  on  the  variance  of 
an  unbiased  estimator  of  the  parameter  9  ([21]),  leading  to  the  fact  that  (Var^T^X)])-1 
is  the  minimum- variance  step  size  in  stochastic  approximation  ([16]).  Moreover,  from 
the  optimization  perspective,  the  term  (VargfT^X)])-1  relates  the  gradient  search  on  the 
parameter  space  with  the  stochastic  search  on  the  solution  space,  and  thus  adaptively 
adjusts  the  updating  of  the  sampling  distribution  to  our  belief  about  the  promising  regions 
of  the  solution  space.  To  see  this  more  easily,  let  us  consider  T(X)  =  X.  Then  (Vare[X])_1 
is  smaller  (i.e.,  the  gradient  step  in  updating  9  is  smaller)  when  the  current  sampling 
distribution  is  more  flat,  signifying  the  exploration  of  the  solution  space  is  still  active  and 
we  do  not  have  a  strong  belief  (i.e.  f(-‘,9))  about  promising  regions;  (Vare[X])_1  is  larger 
(i.e.,  the  gradient  step  in  updating  9  is  larger)  when  our  belief  f(-;9)  is  more  focused  on 
some  promising  regions. 
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4.  Convergence  Analysis 

We  will  analyze  the  convergence  properties  of  GASS,  resorting  to  methods  and  results  in 
stochastic  approximation  (e.g.,  [12,  13,  1]).  In  GASS,  Vel(9;9k) \e=ek  is  estimated  by 

Vem-9k)  =  EPk[T(X)}  -  Eek[T(X)}.  (16) 

To  simplify  notations,  we  denote 

Vk  E  Va rek[T(X)]  +  eI,  V*  4  Var gk[T(X)]  +  eI. 

Hence,  the  parameter  updating  iteration  in  GASS  is 

ek+l  =  n§  { ek  +  akv^v ei(ok;  ok)} ,  (17) 

which  can  be  rewritten  in  the  form  of  a  generalized  Robbins-Monro  algorithm 

9k+ 1  =  9k  +  cnk[D{9k)  +  bk  +  £&  +  zk\,  (18) 


where 

D(9k)  =  (Var  gk[T(X)}  +  eiy1  Vel(9k;9k), 
bk  =  V^(EPk[T{X)]-EPk[T(X)]), 

Sk  =  (Vk1  -  n_1)  (EPk[T(X)]  -  Eek[T(X)])  +  V,-1  (EPk[T{X)}  -  EPk[T{X)])  , 

and  zk  is  the  projection  term  satisfying  akzk  =  6k+\  —  9k  —  ak[D(9k)+bk+^k],  the  minimum 
Euclidean  length  vector  that  takes  the  current  iterate  back  onto  the  constraint  set.  The 
term  D(9k )  is  the  gradient  vector  held,  bk  is  the  bias  due  to  the  inexact  evaluation  of  the 
shape  function  in  EPk[T(X)\  (bk  is  zero  if  the  shape  function  can  be  evaluated  exactly), 
and  £k  is  the  noise  term  due  to  Monte  Carlo  sampling  in  the  approximations  Var gk[T(X)\ 
and  Epk[T(X)}. 

For  a  given  9  £  0,  we  define  a  set  C{9)  as  follows:  if  9  lies  in  the  interior  of  0,  let 
C{9 )  =  {0};  if  9  lies  on  the  boundary  of  0,  define  C(9)  as  the  infinite  convex  cone  generated 
by  the  outer  normals  at  9  of  the  faces  on  which  9  lies  ([13]  pp.  106).  The  difference  equation 
(18)  can  be  viewed  as  a  noisy  discretization  of  the  constrained  ordinary  differential  equation 
(ODE) 

9t  =  D{9t)  +  zt,  zt  £  —C(9t),  t  >  0,  (19) 

where  zt  is  the  minimum  force  needed  to  keep  the  trajectory  of  the  ODE  in  0.  Thus, 
the  sequence  of  {9k}  generated  by  (18)  can  be  shown  to  asymptotically  approach  the 
solution  set  of  the  above  ODE  (19)  by  using  the  well-known  ODE  method.  Let  ||  •  || 
denote  the  vector  supremum  norm  (i.e.,  [|x||  =  max{|xj|})  or  the  matrix  max  norm  (i.e., 
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||A||  =  max{|ajj|}).  Let  ||  •  || 2  denote  the  vector  2-norm  (i.e.,  ||x||  =  \fx\  +  . . .  +  x%)  or  the 
matrix  norm  induced  by  the  vector  2-norm  (also  called  spectral  norm  for  a  square  matrix, 
i.e.,  \\A\\2  =  \J  Xmax(A*A),  where  A*  is  the  conjugate  transpose  of  A  and  A  max  returns  the 
largest  eigenvalue). 

To  proceed  to  the  formal  analysis,  we  introduce  the  following  notations  and  assump¬ 
tions.  We  denote  the  sequence  of  increasing  sigma-fields  generated  by  all  the  samples  up 
to  the  kth  iteration  by 

{*»  =  *  ({*$>£..  Ui}£, . (4}&)  = 


Dehne  notations 


Ufe  :  ¥fc  := 

i=  1 
Nk 

Ufc  :=  WY/Sok(H(xi))T(xi),  Yk  := 

i=  1 


4)) 

1=1 

Nk 

JrY,S0k(H(xl)) 

1=1 

Edk[Sek(H(X))}. 


Assumption  1. 

(i)  The  step  size  sequence  {ak}  satisfies  ak  >  0  for  all  k,  ak  \  0  as  k  — >  00,  and 

EOO 

k= 0  ak  ~  00  • 

(ii)  The  sample  size  Nk  =  N^k^ ,  where  f  >  0;  moreover,  {a^}  and  {IV^}  jointly  satisfies 
=  O(k-P)  for  some  constant  /3  >  1. 

(in)  The  function  x  i->-  T(x)  is  bounded  on  X . 

(iv)  For  any  x,  \Sdk(H(x ))  -  Sek(H(x))\  ->•  0  w.p.l  as  Nk  00. 

In  the  above  assumption,  (i)  is  a  typical  assumption  on  the  step  size  sequence  in  SA, 
which  means  that  ak  diminishes  not  too  fast.  Assumption  1  (ii)  provides  a  guideline  on 
how  to  choose  the  sample  size  given  a  choice  of  the  step  size  sequence,  and  shows  that  the 
sample  size  has  to  increase  to  infinity  no  slower  than  a  certain  speed.  For  example,  if  we 
choose  ak  =  aok~a  with  0  <  a  <  1,  then  it  is  sufficient  to  choose  Nk  =  0(k2^~a^).  As¬ 
sumption  l(iii)  holds  true  for  many  exponential  families  used  in  practice.  Assumption  l(iv) 
is  a  sufficient  condition  to  ensure  the  strong  consistency  of  estimates,  and  is  satisfied  by 
many  choices  of  the  shape  function  Sq.  For  example,  it  is  trivially  satisfied  if  Sg  =  S,  since 
S(H(x))  can  be  evaluated  exactly  for  each  x.  If  Sg  takes  the  form  of  (4),  Assumption  l(iv) 
is  also  satisfied,  as  shown  in  the  following  lemma. 

Lemma  1 .  Suppose  the  shape  function  takes  the  form 

Sgk(H{x))  =  (H(x)  -  Hih)  1  +  es0(H{x)-7ek)’ 
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where  jgk  =  sup;  {l  :  Pgk{x  G  X  :  H(x)  >  1}  >  p}  is  the  unique  (1  —  p)- quantile  with  respect 
to  /(•;  Ok)-  If  Sek{H(x))  is  estimated  by  Sgk(H(x ))  with  the  true  quantile  ^gk  being  replaced 
by  the  sample  (1  —  p)-quantile  'ygk  =  m_p)jvfc])5  where  [a]  is  the  smallest  integer  greater 
than  a,  and  is  the  ith  order  statistic  of  the  sequence  {H(x\),  i  =  1, . . . ,  A7. } .  Then  un¬ 
der  the  condition  Nk  =  Q(k^)  (  >  0,  we  have  that  for  every  x,  | S$k(H(x))  —  Sgk(H(x))\  — >• 
0  w.p.l  as  k  — >  00. 


The  next  lemma  shows  that  the  summed  tail  error  goes  to  zero  w.p.l. 
Lemma  2.  Under  Assumption  1  (i)-(iii),  for  any  T  >  0, 


lim 

k — ^00 


sup 

>:0<E  Zfai<T} 


]  ati£i 


i=k 


=  0, 


w.p.l. 


Theorem  1  below  shows  that  GASS  generates  a  sequence  {Ok}  that  asymptotically 
approaches  the  limiting  solution  of  the  ODE  (19)  under  the  regularity  conditions  specified 
in  Assumption  1. 


Theorem  1.  Assume  that  D(0t)  is  continuous  with  a  unique  integral  curve  (i.e.,  the  ODE 
(19)  has  a  unique  solution  Oft) )  and  Assumption  1  holds.  Then  the  sequence  {Ok}  generated 
by  (17)  converges  to  a  limit  set  of  (19)  w.p.l.  Furthermore,  if  the  limit  sets  of  (19)  are 
isolated  equilibrium  points,  then  w.p.l  {Ok}  converges  to  a  unique  equilibrium  point. 

For  a  given  distribution  family,  Theorem  1  shows  that  our  algorithm  will  identify  a 
local/global  optimal  sampling  distribution  within  the  given  family  that  provides  the  best 
capability  in  generating  an  optimal  solution  to  (1).  From  the  viewpoint  of  maximizing 
Eq[H(X)\,  the  average  function  value  under  our  belief  of  where  promising  solutions  are 
located  (i.e.,  the  parameterized  distribution  f(x,0)),  the  convergence  of  the  algorithm  to 
an  local/global  optimum  in  the  parameter  space  essentially  gives  us  a  local/global  optimum 
of  our  belief  about  the  function  value. 


4.1  Asymptotic  Normality  of  GASS 

In  this  section,  we  study  the  asymptotic  convergence  rate  of  Algorithm  1  under  the  assump¬ 
tion  that  the  parameter  sequence  {9k}  converges  to  a  unique  equilibrium  point  0*  of  the 
ODE  (19)  in  the  interior  of  0.  This  indicates  that  there  exists  a  small  open  neighborhood 
A f(0*)  of  0*  such  that  the  sequence  {Ok}  will  be  contained  in  A f(0*)  for  k  sufficiently  large 
w.p.l.  Thus,  the  projection  operator  in  (17)  and  Zk  in  (18)  can  be  dropped  in  the  analysis, 
because  the  projected  recursion  will  behave  identically  to  an  unconstrained  algorithm  in 
the  long  run.  Define  C(0)  =  Ve>l{0'  and  let  Jc  be  the  Jacobian  of  C.  Under  our 

conditions,  it  immediately  follows  from  (19)  that  C(0*)  =  {0}  and  C(0*)  =  0.  Since  C  is 
the  gradient  of  some  underlying  function  F(0),  Jc  is  the  Hessian  of  F  and  Algorithm  1  is 
essentially  a  gradient-based  algorithm  for  maximizing  F[Q).  Therefore,  it  is  reasonable  to 
expect  that  the  following  assumption  holds: 
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Assumption  2.  The  Hessian  matrix  Jc{9)  is  continuous  and  symmetric  negative  definite 
in  the  neighborhood  Af  (6*)  of  9* . 

We  consider  a  standard  gain  sequence  ak  =  ao/ka  for  constants  «o  >  0  and  0  <  a  <  1, 
a  polynomially  increasing  sample  size  Nk  =  Nohfi  with  Nq  >  1  and  f  >  0. 

By  dropping  the  projection  operator  in  (17),  we  can  rewrite  the  equation  in  the  form: 

4+i  =  4  +  k~a<S>kC(9k)  +  k~a$k(^  - 

Vfc/ 

where  4  =  4  —  0*  and  <4  =  ao(Vargfe(T(A))  +  e/)-1.  Next,  by  using  a  first  order  Taylor 
expansion  of  C{0k)  around  the  neighborhood  of  9*  and  the  fact  that  C{9*)  =  0,  we  have 

4+i  =  4  +  k-a*kJc(6k)8k  +  k-a$k(^-  - 

vvfc  Vfc/ 

where  9k  lies  on  the  line  segment  from  9k  to  6*.  For  a  given  positive  constant  r  >  0,  the 
above  equation  can  be  further  written  in  the  form  of  a  recursion  in  [5]: 

4+1  =  (I  ~  k~aTk)5k  +  k~^/^kWk  +  k~a~^2Tk, 

where  Tfc  =  -<44(4),  Wk  =  k^l 2(|j  -  Egk  [fj|4_i]),  and  Tk  =  W2 <4(|  -  + 

Eok  [4  | -74,—  i ]  —  ^).  The  basic  idea  of  the  rate  analysis  is  to  show  that  the  sequence 

of  amplified  differences  {kT^25k}  converges  in  distribution  to  a  normal  random  variable 
with  mean  zero  and  constant  covariance  matrix.  To  this  end,  we  show  that  all  sufficient 
conditions  in  Theorem  2.2  in  [5]  are  satisfied  in  our  setting.  We  begin  with  a  strengthened 
version  of  Assumption  1  (iv). 

Assumption  3. 

For  a  given  constant  r  >  0  and  x  G  X,  kT^2\Sgk(H(x))  —  Sgk(H(x))\  — >  0  as  k  — >  oo  w.p.l. 

Assumption  3  holds  trivially  when  S$  is  a  deterministic  function  that  is  independent 
of  9.  In  addition,  if  sample  quantiles  are  involved  in  the  shape  function  and  Sgk(H(x)) 
takes  the  form  (4),  then  the  assumption  can  also  be  justified  under  some  additional  mild 
regularity  conditions;  cf.  e.g.,  [9]. 

Let  =  ao('Vaxg*(T(X))  +  e/)_1  and  T  =  —  <&Jc(9*).  The  following  result  shows 
condition  (2.2.1)  in  Theorem  2.2  of  [5]. 

Lemma  3.  Assume  Assumptions  1  and  2  hold,  we  have  4  — >■  3>  and  4  -+  T  as  k  —*■  oo 
w.p.l.  In  addition,  if  Assumption  l(iv)  is  replaced  with  Assumption  3  and  Nk  =  N^k^  with 
(  >  r/2,  then  Tk  — >  0  as  k  -+  oo  w.p.l. 

In  addition,  the  noise  term  Wk  has  the  following  property,  which  justifies  condition 
2.2.2  in  [5]. 
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Lemma  4.  Egk  [Wkl^k-i]  =  0-  In  addition,  let  r  be  a  given  constant  satisfying  r  >  a.  If 
Assumption  1  holds  and  Nk  =  NokT~a,  then  there  exists  a  positive  semi-definite  matrix  X 
such  that  limfe_>oo  Eok\WkW];\Fk-\\  =  X  w.p.l,  and  lim^oo  E[I{\\Wk\\2  >  rka}\\ Wk\\2]  = 
0  Vr  >  0. 

The  following  asymptotic  normality  results  then  follows  directly  from  Theorem  2.2  in 
[5], 

Theorem  2.  Let  ak  =  cto/ka  for  0  <  a  <  1.  For  a  given  constant  r  >  2a,  let  Nk  = 
N()kT~a).  Assume  the  convergence  of  the  sequence  {6k}  occurs  to  a  unique  equilibrium 
point  9*  w.p.l.  If  Assumptions  1,  2,  and  3  hold,  then 

k^(9k  -  6*)  -^4  N(0,QMQt), 

where  Q  is  an  orthogonal  matrix  such  that  QT  (— Jc{0*))Q  =  A  with  A  being  a  diagonal 
matrix,  and  the  ( i,j)th  entry  of  the  matrix  M.  is  given  by  Mn.  n  =  (QT Q)uj\(ku^  + 

Theorem  2  shows  the  asymptotic  rate  at  which  the  noise  caused  by  Monte-Carlo  random 
sampling  in  GASS  will  be  damped  out  as  the  number  of  iterations  k  — >  oo.  This  rate,  as 
indicated  in  the  theorem,  is  on  the  order  of  0(1/ \/W).  This  implies  that  the  noise  can  be 
damped  out  arbitrarily  fast  by  using  a  sample  size  sequence  {Nk}  that  increases  sufficiently 
fast  as  k  — >  oo.  However,  we  note  that  this  rate  result  is  stated  in  terms  of  the  number  of 
iterations  k,  not  the  sample  size  Nk-  Therefore,  in  practice,  there  is  the  need  to  carefully 
balance  the  tradeoff  between  the  choice  of  large  values  of  Nk  to  increase  the  algorithms’s 
asymptotic  rate  and  the  use  of  small  values  of  Nk  to  reduce  the  per  iteration  computational 
cost. 


5.  Numerical  Experiments 


We  test  the  proposed  algorithms  GASS,  GASS_avg  on  some  benchmark  continuous  opti¬ 
mization  problems  selected  from  [8]  and  [9].  To  fit  in  the  maximization  framework  where 
our  algorithms  are  proposed,  we  take  the  negative  of  those  objective  functions  that  are 
originally  for  minimization  problems.  The  ten  benchmark  problems  are  listed  as  below. 


(1)  Dejong’s  5th  function  (n=2,  —50  <  x%  <  50) 


H i  (x  )  =  - 


25 


0.002  +  ~ 


J  +  £i=i(**-o 


'jij 


-i 


where  an  =  (-32,  -16, 0, 16,  32,  -32,  -16,  0, 16, 32,  -32,  -16,  0, 16,  32,  -32,  -16,  0, 16, 
32,  -32,  -16, 0, 16,  32)  and  aj2  =  (-32,  -32,  -32,  -32,  -32,  -16,  -16,  -16,  -16,  -16,  0 
0,  0,  0, 0, 16, 16, 16, 16, 16,  32,  32, 32,  32,  32).  The  global  optimum  is  at  x*  =  (—32,  — 32)T 
and  H*  «  -0.998. 
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(2)  Shekel’s  function  (n=4,  0  <  x*  <  10  ) 

5 

H2(x)  =  ((s  -  aj)T(x  -  dj)  +  Cj)  , 

i=  1 

where  a\  =  (4, 4, 4, 4)T,  a2  =  (1,1,1,1)T,  a3  =  (8,8,8,8)T,  a4  =  (6,6,6,6): 
(3,  7, 3,  7)r,  and  c  =  (0.1, 0.2,  0.2, 0.4,  0.4).  x*  =  (4, 4, 4, 4)T,  H*  «  10.153. 

(3)  Powel  singular  function  (n=50,  —50  <  <  50) 

n—  2 

H3(x)  =  -^  [(xj_i  +  10xi)2  +  5(xj_|_i  -  xi+2)2  +  (xi  -  2xi+i)4  +  10(xj_i  - 

i=2 

where  x*  =  (0,  •  •  •  ,0)T,  77*  =  —1. 

(4)  Rosenbrock  function  (n=10,  —10  <  Xj  <  10) 


n— 1 

i74(.T)  =  -  ^  [l00(xi+i  -  x2)2  +  (x*  -  l)2]  -  1, 

7= 1 

where  x*  =  (1,  •  •  •  ,  1)T,  77*  =  —1. 

(5)  Griewank  function  (n=50,  —50  <  x$  <  50) 


^s(x) 


1  /t.  /«- 

—  — ~ —  x2  +  TT  cos 

4000  ^  1 


-1, 


where  x*  =  (0,  •  •  •  ,  0)T,  77*  =  0. 

(6)  Trigonometric  function  (n=50,  —50  <  Xj  <  50) 

n 

Hq(x)  =  —  ^  [8 sin2(7(x,  —  0.9)2)  +  6 sin2(14(xj  —  0.9)2)  +  (x*  —  0.9)2] 

where  x*  =  (0.9,  •  •  •  ,  0.9)T,  77*  =  —1. 

(7)  Rastrigin  function  (n=20,  —5.12  <  Xi  <  5.12) 

n 

H-j(x)  =  —  ^2  {X1  ~  10cos(27TXj))  —  lOro  —  1, 

7=1 

where  x*  =  (0,  •  •  •  ,0)T,  77*  =  —  1. 


»  «5  — 


x®+2)4]  —  1, 
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(8)  Pinter’s  function  (n=50,  —50  <  x,  <  50) 

n  n 

J2ixi+J2  20isin2(xi_i  sin  xt  —  x%  +  sinrcj+i) 
i=  1  i=l 

n 

+  ^  i  log10(l  +  -  2 x*  +  3xi+\  -  cos  x*  +  l)2) 

i= 1 

where  x*  =  (0,  •  •  •  ,  0)T,  H*  =  —1. 

(9)  Levy  function  (n=50,  —50  <  Xi  <  50) 

n—  1 

H9(x)  =  -sin2(7ryi)-^  [(yi  -  1)2(1  +  10  sin2  (717/*  +  1))] -(yn-l)2(l+10 sin2(27ryn))-l, 
i= 1 

where  yt  =  1  +  (xt  -  l)/4,  x*  =  (1,  •  •  ■  ,  1)T,  iL*  =  -1. 

(10)  Weighted  Sphere  function  (n=50,  —50  <  x*  <  50) 

n 

Hio(x)  =  ixf  -  1 

where  x*  =  (0,  •  •  •  ,  0)T,  iL*  =  —1. 

Specihcally,  Dejong’s  5th  (fli)  and  Shekel’s  (H2)  are  low-dimensional  problems  with  a  small 
number  of  local  optima  that  are  scattered  and  far  from  each  other;  Powel  (H3)  and  Rosen- 
brock  (H4)  are  badly-scaled  functions;  Griewank  (H, 5),  Trigonometric  {Hq),  and  Rastrigin 
(Hj)  are  high-dimensional  multimodal  problems  with  a  large  number  of  local  optima,  and 
the  number  of  local  optima  increases  exponentially  with  the  problem  dimension;  Pinter 
(Hs)  and  Levy  (Hg)  are  both  multimodal  and  badly-scaled  problems;  Weighted  Sphere 
function  (Rio)  is  a  high-dimensional  concave  function. 

We  compare  the  performance  of  GASS  and  GASS_avg  with  two  other  algorithms:  the 
modified  version  of  the  CE  method  based  on  stochastic  approximation  proposed  by  [9]  and 
the  MRAS  method  proposed  by  [8].  In  our  comparison,  we  try  to  use  the  same  parameter 
setting  in  all  four  methods.  The  common  parameters  in  all  four  methods  are  set  as  follows: 
the  quantile  parameter  is  set  to  be  p  =  0.02  for  low-dimensional  problems  H\  and  H2 , 
and  p  =  0.05  for  all  the  other  problems;  the  parameterized  exponential  family  distribution 
/(x;  6k)  is  chosen  to  be  independent  multivariate  normal  distribution  Life);  the  initial 

mean  //q  is  chosen  randomly  according  to  the  uniform  distribution  on  [—30,30]™,  and  the 
initial  covariance  matrix  is  set  to  be  So  =  1000/nxnj  where  n  is  the  dimension  of  the 
problem;  the  sample  size  at  each  iteration  is  set  to  be  =  1000.  In  addition,  we  observe 
that  the  performance  of  the  algorithm  is  insensitive  to  the  initial  candidate  solutions  if  the 
initial  variance  is  large  enough. 


H$(x)  =  - 
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In  GASS  and  GASS_avg,  we  consider  the  shape  function  of  the  form  (4),  i.e. 


Set(H(x))  =  («(,)  -  g|>)1  +  e_So^w_7,t), 

In  our  experiment,  So  is  set  to  be  105 ,  which  makes  Sgk(H(x))  a  very  close  approximation 
to  ( H(x )  —  Ha,)I{H(x)  >  7efc};  the  (1  —  p)-quantile  'yek  is  estimated  by  the  (1  —  p)  sample 
quantile  of  the  function  values  corresponding  to  all  the  candidate  solutions  generated  at 
the  kth  iteration.  We  use  the  step  size:  a*,  =  ao/ka,  where  ao  reflects  the  initial  step  size, 
and  the  parameter  a  should  be  between  0  and  1.  We  set  «o  =  0.3  for  the  low-dimensional 
problems  H\  and  and  the  badly-scaled  problem  H&,  and  set  ao  =  1  for  the  rest  of 
the  problems;  we  set  a  =  0.05,  which  is  chosen  to  be  relatively  small  to  provide  a  slowly 
decaying  step  size.  With  the  above  setting  of  step  size,  we  can  always  find  a  f3  such  that  the 
sample  size  Nk  =  1000  satisfies  the  Assumption  1  (ii)  under  a  finite  number  of  iterations, 
e.g.  k  <  2500  in  our  experiment.  In  GASS_avg,  the  feedback  weight  is  c  =  0.002  for 
problems  H3,  H 4  and  H$  and  c  =  0.1  for  all  other  problems. 

In  the  modified  CE  method,  we  use  the  gain  sequence  a =  5 /(k  +  lOO)0'501,  which  is 
found  to  work  best  in  the  experiments.  In  the  implementation  of  MRAS  method,  we  use  a 
smoothing  parameter  v  when  updating  the  parameter  9k  of  the  parameterized  distribution, 
and  set  v  =  0.2  as  suggested  by  [8].  The  rest  of  the  parameter  setting  for  MRAS  is  as 
follows:  A  =  0.01,  r  =  10-4  in  the  shape  function  S(H(x))  =  exp{rH(x)}.  Other  than 
using  an  increasing  sample  size  in  [9]  and  [8],  and  updating  quantile  pk  in  [8],  the  constant 
sample  size  N  =  1000  and  a  constant  p  are  used  in  our  experiments  for  a  fair  comparison 
of  all  the  methods. 


GASS 

GASS_avg 

modified  CE 

MRAS  j 

H* 

H * ( std.err ) 

Me 

H*  (std_err) 

Me 

H*  (std_err) 

Ms 

H*  (std.err) 

Me 

Dejong’s  5th  H\ 

-0.998 

-0. 998(4. 79E-7) 

100 

-0. 998(8. 97E-7) 

100 

-1.02(0.014) 

95 

-0. 9981(6. 63E-4) 

98 

Shekel  H2 

10.153 

9.92(0.114) 

96 

9.91(0.106) 

95 

10.153(1. 09E-7) 

79 

9.90(0.126) 

96 

Powel  Hg 

-1 

-l(1.48E-6) 

100 

-l(1.89E-6) 

100 

-l(8.87E-9) 

100 

-1.50(0.433) 

95 

Rosenbrock  H4 

-1 

-1.03(1.40E-4) 

0 

-1.09(0.0301) 

46 

-1.91(0.016) 

0 

-7.10(0.629) 

0 

Griewank  H$ 

0 

0(8.45E-15) 

100 

0(7.30E-15) 

100 

-0(3.02E-16) 

100 

-0.14(0.017) 

57 

Trigonometric  Hq 

-1 

-1(9.72E-13) 

100 

-1(1.08E-12) 

100 

-1(2.23E-18) 

100 

-l(4.69E-7) 

100 

Rastrigin  H7 

-1 

-1.15(0.0357) 

85 

-1.19(0.044) 

83 

-1.01(0.0099) 

99 

-83.45(0.634) 

0 

Pinter  Hg 

-1 

-1.007(0.0034) 

93 

-1.04(0.0104) 

63 

-6.08(0.0254) 

0 

-530.4(48.64) 

2 

Levy  Hq 

-1 

-1(9.56E-13) 

100 

-l(1.29E-7) 

100 

-1. 063(3. 87E-18) 

100 

-1(1.42E-10) 

100 

Sphere  H 10 

-1 

-1(1.79E-11) 

100 

-1(1.42E-11) 

100 

-1(2.23E-18) 

100 

-l(9.95E-9) 

100 

Table  1:  Comparison  of  GASS,  GASS_avg,  modified  CE  and  MRAS 


In  the  experiments,  we  found  the  computation  time  of  function  evaluations  dominates 
the  time  of  other  steps,  so  we  compare  the  performance  of  the  algorithms  with  respect  to 
the  total  number  of  function  evaluations,  which  is  equal  to  the  total  number  of  samples. 
The  average  performance  based  on  100  independent  runs  for  each  method  is  shown  in  Table 
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1,  where  H*  is  the  true  optimal  value  of  Ff(-);  H*  is  the  average  of  the  function  values 
returned  by  the  100  runs  of  an  algorithm;  std-err  is  the  standard  error  of  these  100  function 
values;  Me  is  the  number  of  e-optimal  solutions  out  of  100  runs  (e-optimal  solution  is  the 
solution  such  that  H*  —  H*  <  e,  where  H*  is  the  optimal  function  value  returned  by  an 
algorithm).  We  consider  e  =  10~2  for  problems  H4,  H7.  Hg  and  e  =  10-3  for  all  other 
problems.  Fig.  1  and  Fig.  2  show  the  average  (over  100  runs)  of  best  value  of  H(-)  at  the 
current  iteration  versus  the  total  number  of  samples  generated  so  far. 

From  the  results,  GASS  and  GASS.avg  find  all  the  e-optimal  solutions  in  100  runs  for 
problems  H\ ,  Hg,  Hq,  Hq,  Hg,  and  #10.  Modified  CE  finds  all  the  e-optimal  solutions  for 
problems  Hg,  Hq,  Hq,  Hg,  and  H\g.  MRAS  only  finds  all  the  e-optimal  solutions  for  the 
problems  Hq  and  Hg  and  the  convex  problem  H\g.  As  for  the  convergence  rate,  GASS_avg 
always  converges  faster  than  GASS,  verifying  the  effectiveness  of  averaging  with  online 
feedback.  Both  GASS  and  GASS_avg  converge  faster  than  MRAS  on  all  the  problems,  and 
converge  faster  than  the  modified  CE  method  when  ao  is  set  to  be  large,  i.e.  on  problems 
Hq  and  Hq  -  H\g. 


6.  Conclusion 

In  this  paper,  we  have  introduced  a  new  model-based  stochastic  search  algorithm  for  solv¬ 
ing  general  black-box  optimization  problems.  The  algorithm  generates  candidate  solutions 
from  a  parameterized  sampling  distribution  over  the  feasible  region,  and  uses  a  quasi- 
Newton  like  iteration  on  the  parameter  space  of  the  parameterized  distribution  to  find 
improved  sampling  distributions.  Thus,  the  algorithm  enjoys  the  fast  convergence  speed 
of  classical  gradient  search  methods  while  retaining  the  robustness  feature  of  model-based 
methods.  By  formulating  the  algorithm  iteration  into  the  form  of  a  generalized  stochastic 
approximation  recursion,  we  have  established  the  convergence  and  convergence  rate  re¬ 
sults  of  the  algorithm.  Our  numerical  results  indicate  that  the  algorithm  shows  promising 
performance  as  compared  with  some  of  the  existing  approaches. 


A.  Appendix 

Proof.  Proof  of  Proposition  1.  Consider  the  gradient  of  L(9]9')  with  respect  to  9, 
VeL(M')  =  J  Se'(H(x))Vef(x;9)dx 

=  J  Sgr(H(x))f  (x;0)V e  In  f(x;9)dx 

=  Ee[Se'(H(X))Ve\nf(X-9)],  (20) 

where  the  interchange  of  integral  and  derivative  in  the  first  equality  follows  from  the  bound¬ 
edness  assumptions  on  Sg>  and  Vef(x'9)  and  the  dominated  convergence  theorem. 
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Consider  the  Hessian  of  L{9]  9')  with  respect  to  9, 

Vim#)  =  J  Sg> (H (x))Vgf  (x;  9)dx 

=  j  Sg>(H(x))f(x-,9)Vglnf(x;9)dx  +  J  Sg>(H(x))Vg  ln/(x;  9)Vgf(x]  9)T dx 

=  Eg[Sgi(H(X))Vg  Infix-  0)]  +  Eg[Sg,(H(X))Vg  In  fix-,  9)V  g  Infix-  £)r](21) 

where  the  last  equality  follows  from  the  fact  that  Vgfix]9)  =  fix]  9)V  g  In /(x;  9). 
Furthermore,  if  fix]  9)  =  expjfP  T(x)  —  4>i9)},  we  have 


Vglnfix]9)  =  Vg  ^0tT(x)  —  In  J  exp  (#7r(x))dx^ 

f  exp(0TT{x))T(x)dx 
f  exp i9TTix))dx 

=  Tix)  -  Eg[TiX)\.  (22) 

Plugging  (22)  into  (20)  yields 

VeLi9]0')  =  Eg[S0fHiX))TiX)\  -  Ee[SefHiX))\Ee[TiX)}. 

Differentiating  (22)  with  respect  to  9,  we  obtain 

V2lnffr-n  -  /exp  (6>rT(x))T(x)T(x)Tdx 

Velnf(x,9)  - 

f  exp(#TT(x))T(x)dx  (J  exp(#TT(x))T(x)dx)7 
(/ exp(#TT(x))<ix)" 

=  -Ee[TiX)TiX)T]  +  Eg[TiX)]Ee[TiX)}T 
=  -Var  g[TiX)].  (23) 

Plugging  (22)  and  (23)  into  (21)  yields 

VlmO')  =  Eg[SgfHiX))iTiX)-Eg[TiX)})iTiX)-Eg[TiX)})T] 

-  Var e[TiX)\Eg[Se#HiX))]. 

□ 


Proof.  Proof  of  Proposition  2.  Consider  the  gradient  of  li9]9')  with  respect  to  9, 


Veli9]9%=e , 


VeLj9]0') 

9')  Q_QI 

J  Sg#Hix))fix]9)Vg  In  fix]9)dx 
Li9] 9') 

Ep^-g^iV g  InfiX]  9')\. 


(24) 
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Differentiating  (24)  with  respect  to  0 ,  we  obtain  the  Hessian 


V^(M') \e=e> 


f  Se/(H(x))f(x;  0)V2  lnf(x ;  0)dx  f  Sg,(H(x))Vg  ln/(x;  0)(Vfl/(z;  0))rdx 

L(0;0')  +  L(0;0') 

(f  S0,(H(x))f(x;  0)Vg  ln/(s;  0)d*)(V0L(0;  0'))T 

£(M')2  *=* 


Using  \7gf(x ;  0)  =  f(x:  0)\7g  ln/(x;  0)  in  the  second  term  on  the  righthand  side,  the  above 
expression  can  be  written  as 


ViZ(M%=0'  =  %;*)[Vglii/(X;0')]  +  £p(,*')  [V*ln/(X;  0')(V^  ln/(X;  0'))T] 

-  Ep(..0l)  [Ve\n f{X- 0')]  Ep{.,ei)  [Vein fiX ;0')]T 
=  Ep{..0l)  [V2e  In  /(X;  0’)}  +  Vaxp(,e0  [V0  In  /(X;  0’)\  .  (25) 

Furthermore,  if  fix;0)  =  exp {0tT(x)  —  0(0)},  plugging  (22)  into  (24)  yields 


Veli0;0') \e=o’  =  Ep{..e>)[TiX)}  -  Egi [T(X)J, 
and  plugging  (22)  and  (23)  into  (25)  yields 

X2eli0-0')\e=e,  =  Varp(.;0,)[T(X)]  -  Vare,[T(X)]. 


□ 

Proof.  Proof  of  Lemma  1.  Because  S$  is  continuous  in  70,  it  is  sufficient  to  show  that 
fgk  — >  7 gk  w.p.l  as  k  — >  oo,  which  can  be  shown  in  the  same  way  as  Lemma  7  in  [8],  except 
that  we  need  to  verify  the  following  condition  in  their  proof: 

OO 

exp  <  oo, 

fc=i 


where  M  is  positive  constant.  It  is  easy  to  see  that  this  condition  is  trivially  satisfied  in 
our  setting  by  taking  =  XoAA  with  £  >  0.  □ 

Proof.  Proof  of  Lemma  2. 

Before  the  formal  proof  of  Lemma  2,  we  first  introduce  a  key  inequality  to  our  proof  -  the 
matrix  bounded  differences  inequality  ([26]),  which  is  a  matrix  version  of  the  generalized 
Hoeffding  inequality  (i.e.,  McDiarmid’s  inequality  ([15])).  Let  Amax(-)  and  Amjn(-)  return 
the  largest  and  smallest  eigenvalue  of  a  matrix,  respectively. 
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Theorem  3.  (Matrix  bounded  differences,  Corollary  7.5,  [26])  Let  {X1  :  i  =  1,2,...  ,N}  be 
an  independent  family  of  random  variables,  and  let  V  be  a  function  that  maps  N  variables 
to  a  self-adjoint  matrix  of  dimension  d.  Consider  a  sequence  of  {Ck}  of  fixed  self-adjoint 
matrices  that  satisfy 

(V(x1,.i.,x',..-,.,xN)-V(x1,...,xt,...,xN))2  <  Cf, 


where  xl  and  xl  range  over  all  possible  values  of  X1  for  each  index  i.  Compute  the  variance 
parameter 


a 


2  . 


Then,  for  all  5  >  0, 


-P{Amax(V(x)  -  E[V(x)])  >  (5}  <  dexp 


where  x  =  (X1, . . . ,  XN). 


Now  we  proceed  to  the  formal  proof  of  Lemma  2.  Recall  that  fk  can  be  written  as 
&  =  (Vk1  -  V^)(EPk[T(X)\  -  E0k[T(X )])  +  V-\EPk[T{X)}  -  EPk[T(X)]).  (26) 


To  bound  the  first  term  on  the  right-hand-side  in  (26),  we  notice  that  since  V.  1  and  V,  1 
are  both  positive  definite  and  (e~1I—Vf~1)  and  (e~1I—Vl71)  are  both  positive  semi-definite, 
we  have 

K~1-*T1II  =  II  y^iVk-Vk)^1]] 

<  Il^1lll|t4-14||||^-1|| 

<  e~2\\Vk-Vk\\.  (27) 

To  establish  a  bound  on  ||14  —  14 1|,  we  use  the  matrix  bounded  differences  inequality  that 
is  introduced  above.  For  simplicity  of  exposition,  we  drop  the  subscript  k  in  the  expression 
below. 


sup 

x1 


)  1  *  *  *  1  ^ 


{v(xl,..,,x\...,xN)  -  R(x\. 

=  ^2  sup  |  [T(xi)T(xi)T  -  T(x*)T(x*)t]  -  J2  {T(xi)  -  T(x®))  T(x^t. 


x1  ,xl£X 

i 


N  - 


I^r^)(r(xi)-T(xi))J 
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where  C  is  a  fixed  positive  semidefinite  matrix.  This  last  inequality  is  due  to  Assump¬ 
tion  l(iv)  that  T(x )  is  bounded  on  X.  Note  that  conditioning  on  Ek-  1,  { x i  =  1, . . . ,  TV k} 
are  i.i.d.,  and  Egk  [14|  J4_i]  =  14.  Then  according  to  the  matrix  bounded  differences  in¬ 
equality,  for  all  5  >  0, 

p  {Amax(Vfc  -  14)  >  6  |J4_i|  <  dexp  ^g||^|^2  ^ 

which  also  implies 

P{- Amin(V5t  ~Vk)>5  l-Ffc-r}  =  P  { Amax(Vfc  -  Vk)  >  5  |^_i}  <  dexp  • 

Recall  that  for  a  symmetric  matrix  A,  ||A||2  =  max(Amax(A),  —Amin (A))  and  ||A||  <  ||A||2. 
Hence, 

P{\\Vk~Vk\\>6\Pk_1}<p[\\Vk-Vk\\2>S\Pk_1}<2deW(^^y 

Recall  that  for  any  nonnegative  random  variable  X , 

/»oo 

E[X]  =  /  P(X  >  x)dx 

Jo 

P  (. X  >  x)  dx. 

So  we  have 

E  [\\Vk  -  Vk\\2  \Ek-i]  <  a  +  J°°  p[\\Vk  -  14||  >  Vx  | Pk}  dx 

Set  a  =  8 IICH 2  log  (2d)/Nk,  and  we  obtain 

E  \\\Vk  -  14||  iJ-fc-il2  <  E  \\\Vk  -  14|| 2  <  8HC,ll2(1  +  1°g(2d)),  (28) 

L  J  L  J  iVfc 

To  bound  the  second  term  in  the  right-hand-side  of  (26),  notice  that  EPk[Tj(X)\  is 
a  self-normalized  importance  sampling  estimator  of  EPk[Tj(X)\,  where  Tj(X)  is  the  jth 
element  in  the  vector  T(X).  Applying  Theorem  9.1.10  (pp.  294,  [2]),  we  have 

E^E^X)}-  E^X^lEk^]  j  =  l  ,...,d, 
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where  Cj's  are  positive  constants  due  to  the  boundedness  of  Tj(x)  on  X.  Hence, 


E 


\EPk[T(X)\-EPk[T(X)]\\\Ek_, 


1  2 


<  E  \\EPk[T{X)}-EPk[T{X)]r\^ 


<  Y,E  ~  EPk[Tj(X)]\2\Fk-i 

3= 1 


< 


d  max,  c 


iVfc 


(29) 


Putting  (28)  and  (29)  together,  we  obtain 

£[IM  <  E  \e-2\\Vk-Vk\\\\EPk[T{X)}~  Edk[T{X)}\\  +  \\V^\\\\EPk[T(X)]  -  EPk[T(X)]\\ 


<  Me~2E 


E 


WVk-Vkiw^ 


+  e~lE 


E 


\EPk[T(X)}-EPk[T(X)]\\\Ek^ 


< 


Me  2  v/8||C'||2(l  +  log  (2d))  +  e  1  y7 d maxj  Cj 

VK 


where  the  positive  constant  M  is  due  to  the  boundedness  of  T(x)  on  X . 
Therefore,  for  any  T  >  0 


E 


£  0*116 


Li=k 


Yl^EiUi 


i=k 

oo 


*  cE 


Oti 


i=k 

oo 


VNl 


cE? 


i=k 


l 

iP 


,i  r°°  l  , 

<  c  -J  +  /  — jdx 


fc/3 


/fc 


1  1  1 

c  I  7a  + 


kP  j3  —  1  fc/3-1  /  ’ 

where  the  first  line  follows  from  the  monotone  convergence  theorem,  and  the  third  line 
follows  from  Assumption  1  (ii) .  For  any  r  >  0,  we  have  from  Markov’s  inequality 


p  a*H6ll  ^  r)  ^ 


\i=k 


T 


<  ~  \  Ta  + 


1 


1  1 


t  \kP  f3  —  1  kP  1 


0  as  k  — >  oo, 
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where  the  last  statement  is  due  to  (3  >  1.  This  result  of  convergence  in  probability  to¬ 
gether  with  the  fact  that  the  sequence  {ESU  Nl&lll  is  monotone  implies  that  the  sequence 
{ESU  Nl&lll  converges  w.p.l  as  k  ->•  oo.  Furthermore,  since  suP{r),:0<^"-fe1  ai<T}  II  Yn=k  a*£*ll  — 

sup{n:0<ErTfe  m<T}  E"=fc  Nl&ll  <  EN  ^116:11,  we  conclude  that  {sup{n:0<En-i  a,<T}  ||  E”=*  aM} 
converges  to  0  w.p.l  as  k  — >•  oo.  □ 


Proof.  Proof  of  Theorem  1.  To  show  our  theorem,  we  apply  Theorem  2.1  in  [11],  The 
condition  on  the  step  size  sequence  in  their  theorem  is  satisfied  by  our  Assumption  1  (i) , 
and  condition  (2.2)  there  is  a  result  of  Lemma  2.  Thus,  to  establish  convergence,  it  is 
sufficient  to  show  bk  — »  0  w.p.l  as  k  — >  oo.  Note  that 


Hence, 


h  =  V^[EPk[T(X)]-EPk[T(X 


r-i  ( Ufc  Ufc  |  Ufc  Ufc 

k  \%  Vfc  Vfc  Vfc 


=  L-iuh%VU+v;-'u‘  Dt 


Vfc¥fc 


INI  < 


IN  IIIINN  ,  ,  IN 


-ii 


I  Vfc  -  Vfc  I  + 
iVfcVfcl  |Vfc| 


IlUfc-Ufcll 


< 


IV, 


-1 


1 


N, 


IVfcVfcl  Nk  ^ 


J2\Sek(H(xi))-S6k(H(xi))\ 


IV, 


E  i4(#(4))  -  sek(H(xim\T(xi)\\. 


Nk 


Since  T(x)  is  bounded,  it  is  easy  to  see  that  VN  is  also  bounded.  Furthermore,  note 

that  HVNlI  is  bounded  and  |Vfc|  is  bounded  away  from  zero.  This  together  with  Assump¬ 
tion  l(iv)  imply  that  the  sequence  {bk}  converges  to  zero  w.p.l.  □ 


Proof.  Proof  of  Lemma  3.  Under  Assumption  1,  we  know  that  the  sequence  {0k} 
converges  w.p.l.  to  a  limiting  point  9*.  This,  together  with  Assumption  l(iii),  im¬ 
plies  that  the  sequence  of  sampling  distributions  {/(x;  9k)}  will  converge  point-wise  in 
x  to  a  limiting  distribution  f(x;9*)  w.p.l.  Note  that  ||Var gk(T(X))  —  Varg.  (T(X))  ||  < 
1 1  Var0fc  (T(X))  -  Var<9fc(T(X))||  +  ||Var0fc(T(X))  -  Var0»(T(X))||.  Clearly,  the  Rrst  term 
converges  to  zero  by  the  strong  consistency  of  the  variance  estimator.  On  the  other  hand, 
using  the  point-wise  convergence  of  {/(•;  9k)}  and  the  dominated  convergence  theorem,  it 
is  easy  to  see  that  the  second  term  also  vanishes  to  zero.  This  shows  <l?fc  — >  w.p.l.  Thus, 
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the  convergence  of  Tk  to  T  is  a  direct  consequence  of  the  continuity  assumption  of  Jc  in 
the  neighborhood  of  8*.  Regarding  Tk ,  we  have 


n  =  v/Wit  -  H*  +  H*  -  HP  +  k^JE,t  t-^I  -  Hi) 

\Yi.  V,.  Y,.  V  LVi.  J  Vi-/ 


=  fcT/2$fcl 

fUfc 

^Vfc  ' 

—  4,1  +  4,2  j 

feT/2^fcUfc 

(  Vfc  — 

V  w 

||4,i||  < 

11**11 

where  TM  =  and  Tk)2  =  [^k-i 

Note  that 


<  \\*k\\  J^F/2|vfc  -  vfc|  +  ||<M-^feT/2||ufc  -  ufc|| 

|VfcVfc|  |Vfc| 

<  II ^ II Ill'll  fc*_/2  ^  \se k(H(xi))  -  Sek(H( 4))| 

“  |VfcVfc|  iVfe  v  feV  v 

||<jj  ||  7,t/ 2  JV 

+  E  |4(^(4))  -  ^fc(JR(4))|||T(4)|| 

|Vfc|  /vfc  ,=1 


Since  T(x)  is  bounded,  it  is  easy  to  see  that  '-LM-  is  also  bounded.  Furthermore,  note 

that  |Vfc|  is  bounded  away  from  zero.  This,  together  with  the  boundedness  of  ||<5fc||  and 
Assumption  3,  imply  that  the  right-hand-side  of  (30)  converges  to  zero  w.p.l. 

For  term  7),.. 2,  let  U^,  and  U‘k  be  the  ith  components  of  Ufc  and  Ufc,  respectively.  By 

u*  ui 

using  a  second  order  two  variable  Taylor  expansion  of  -A  around  we  have 


Hi 

vfc 


ip  1  ip  ~  ip  _  1 

^  +  ^04  -  U*fc)  -  J(Vfc  -  Vfc)  +  ^(Vfc  -  Vfc)2  -  ^(U*fc  -  Ul)(Vk  -  Vfc), 
Vfc  Vfc  Yk  V]J  Y2k 


where  Ui  and  Yk  are  on  the  line  segments  from  IP,  to  Ui  and  from  Yk  to  Yk.  Taking 
conditional  expectations  at  both  sides  of  the  above  equation,  we  have 

rU*  1  fP  rllLPI  ~  i  r  1  _  i 

Edk  ^  J-fc-i  <Edk  ^(Vfc  -  Vfc)2  Fk.,  +  Edk  ^|(U*fc  -  U*fc)(Vfc  -  Vfc)|  ^ 

LVfc  J  Vfc  1  |V^|  1  LV2  1 

<  Cl Eek  [(Vfc  -  Vfc)2|jrfc_1j  +  C2Eek  [|(U*fc  -  4)(Vfc  -  Vfc)||^fc_1 

(31) 


for  constants  Ci  >  0  and  C2  >  0.  Thus,  a  straightforward  calculation  shows  that  the  right- 
hand-side  of  (31)  is  0{Nk1).  Consequently,  we  have  Tk  2  — >  0  w.p.l.  as  k  -»  00  by  taking 
Nk  =  N()k^  with  £  >  r/2.  This  shows  Tk  0  w.p.l.  as  desired.  □ 
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Proof.  Proof  of  Lemma  4.  E(jk  \Wk\Ek_\\  =  0  follows  directly  from  the  definition  of 
W k .  Again,  we  let  Uj(  and  U).  be  the  zth  components  of  Uk  and  U^,  let  Tjf  x)  be  the  ith 
component  of  the  sufficient  statistic  T(x),  and  dehne  as  the  (i,  j)th  entry  of  the  matrix 

Egk  \WkW^\Ek-\}.  By  using  a  hrst  order  two  variable  Taylor  expansion  of  around  y4, 

V  k  k 

we  have 


IT*  IT*  1  IT*  _ 

vf‘vT  +  v;(Ul"Ut)“v|(V‘“V‘)  +  KM 

where  lZk  is  a  reminder  term.  Therefore,  ■  can  be  expressed  as 


(32) 


r/Ut 

rUi 

1 

\  /U{ 

rUl 

1 

\ 

1 

U~E - 

k 

Lv k 

J~k—  1 

k 

LVk 

) 

¥)c_i 

=kT-a^E0k[( Vi  -  U*fc)(U{  -  U [i] 

id 

-  -  U|)(Vfe  -  Vfc)!^!]  [ii] 

IT*  .  .  _ 

-  kT-a^Eek[(Ui  -  U{)(Yk  -  ¥fc)|^_!]  [in] 

IT*  Id 

+  kT~a-^Egk[(yk  -  vfe)Vfc-i]  N 
+  kT~ank, 


where  7 Zk  represents  a  higher-order  term. 

[*]  =  kT~a^(Eek[ Uiuil^-r]  -Ului) 

=  [S0k(H(X))Ti(X)Tj(X)\j7k^1\  -UU4) 

,T_Q  1  u*fcu£\ 

Nk\  El[S0k(H(X))}  Vl  ) 


Nk 


E, 


Pk 


TiWTjiX) 


Pk(X) 


f(x-,ek)i 


~  EPk[Ti(X)]EPk[Tj(X)] 


By  using  a  similar  argument,  it  can  be  seen  that 


k 7 


\n\  = 


\in\  = 


iv  = 


Nk  L 
kT~a  r 
~Nk  L 

kT~a  r 

~Nk  L 


EPk[TAX)\EPk  [T^X) 
Epk[Tt(X)\EPk  \Tj(X) 


Pk(X) 


f(X;6k)l 
Pk{X) 


Epk  EPk  [Tj(X)]  E. 


f{x-ek)\ 

Pk(X ) 


pk 


f{x,ek)\ 


-  Ep^X^Ep^TjiX)} 

-  Epk[UX)]Epk[T3{X)) 

~  EPkmX)\EPk[Tj{X)\ 
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Therefore, 


Sfj  =  [i]  -  [ii]  -  [in]  +  [iv]  +  kT  a7 Zk 

=  J^Epk  [{Tl(x)  ~  E'pdTr(X)m(X)  -  E^TjiX)])^^]  +  V-«nk 
=  k~xVEdk  [{Ti{x)  ~  Epk^xMTAX)  -  Epk[Tj(x)])  ff^lj  +  kT~ank. 

By  taking  Nk  =  NokT~a,  it  can  be  shown  that  the  higher-order  term  kT~a1Zk  is  o(l).  In 
addition,  since  Sg(y)  is  continuous  in  6  for  a  fixed  y,  the  point-wise  convergence  of  /(•;  0k) 
to  /(•;  6*)  implies  that  pk(x )  will  also  converge  in  a  point- wise  manner  to  a  limiting  distri¬ 
bution  p*(x).  Thus,  the  dominated  convergence  theorem  suggests  that  will  converge 
to 

£„■  =  CEg*  [mx)  -  E^TiiXMTjiX)  -  Ep.[Tj(X) D^^y’ 
for  some  positive  constant  C.  Therefore,  the  limiting  matrix  £  is  given  by 

E  =  Cov„, ((T(.Y)  -  Ep,{T(X)])XXL.), 

where  Cov$*(-)  is  the  covariance  matrix  with  respect  to  /(•;#*). 

To  show  the  last  statement,  we  use  Holder’s  inequality  and  write 

lim  E[I{\\Wk\\2  >  rka}\\Wk\\2]  <  limsup  \p{\\Wk\\2  >  rka)  1  5  |h[||B4||4]  1  \  (33) 

k— >oo  fc— >oo  J  L  . 

Note  that 

P{\\Wkf  >  rka )  =  P{\\Wk\\  >  V^“/2) 

E\\\Wk\\2] 

<  - - -  by  Chebyshev’s  inequality 

rka 

_E[Egk[\\Wk\\2\Ek_l]] 

rka 

_  H[tr(£fc)] 
rka 
=  0(k~a ) 

by  taking  Nk  =  NokT~a  for  k  sufficiently  large,  where  the  last  step  follows  because  all 
entries  in  £fc  are  bounded  and  thus  convergence  w.p.l.  implies  convergence  in  expectation. 
On  the  other  hand,  by  (32),  -E'[||B/fc||4]  can  be  expressed  in  terms  of  the  fourth  order  central 
moments  of  the  sample  mean  and  it  can  be  verified  that  ^[HWfcH4]  =  0(1)-  This  shows 
that  the  right-hand-side  of  (33)  is  0(k~  2),  which  vanishes  to  zero  as  k  — >  00.  □ 
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Figure  1:  Comparison  of  GASS,  GASS_avg,  modified  CE  and  MRAS 
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50-D  Pinter  function 


Figure  2:  Comparison  of  GASS,  GASS_avg,  modified  CE  and  MRAS 
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